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Abstract 



t^ Sparsity-constrained optimization has wide applicability in machine learning, statistics, and 

signal processing problems such as feature selection and Compressive Sensing. A vast body 

of work has studied the sparsity-constrained optimization from theoretical, algorithmic, 

and application aspects in the context of sparse estimation in linear models where the 

vj fidelity of the estimate is measured by the squared error. In contrast, relatively less effort 

«y~, has been made in the study of sparsity-constrained optimization in cases where nonlinear 

QQ models are involved or the cost function is not quadratic. In this paper we propose a 

■^ greedy algorithm. Gradient Support Pursuit (GraSP), to approximate sparse minima of 

'^ cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian 

fT^ (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this 

^^ paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance 

^N from the true sparse optimum. Our approach generalizes known results for quadratic cost 

. . functions that arise in sparse linear regression and Compressive Sensing. We also evaluate 

^ the performance of GraSP through numerical simulations on synthetic data, where the 

k> algorithm is employed for sparse logistic regression with and without £2-regularization. 

5—1 Keywords: Sparsity, Optimization, Compressed Sensing, Greedy Algorithm 

1. Introduction 

The demand for high-dimensional data analysis has grown significantly over the past decade 
by the emergence of applications such as social networking, bioinformatics, and mathemati- 
cal finance. In these applications data samples often have thousands of features using which 
an underlying parameter must be inferred or predicted. In many circumstances the number 
of collected samples is significantly smaller than the dimensionality of the data, rendering 
any inference from the data ill-posed. However, it is widely acknowledged that the data sets 
that need to be processed usually exhibit significant structure, which sparsity models are 
often able to capture. This structure can be exploited for robust regression and hypothesis 
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testing, model reduction and variable selection, and more efficient signal acquisition in un- 
derdetermined regimes. Estimation of parameters with sparse structure is usually cast as 
an optimization problem, formulated according to specific application requirements. Devel- 
oping techniques that are robust and computationally tractable to solve these optimization 
problems, even only approximately, is therefore critical. 

In particular, theoretical and application aspects of sparse estimation in linear models 
have been studied extensively in areas such as signal processing, machine learning, and 
statistics. However, sparse estimation in problems where nonlinear models are involved 
have received comparatively little attention. Most of the work in this area extend the use 
of the ^i-norm as a regularizer, effective to induce sparse solutions in linear regression, to 
problems with nonlinear models (see e.g., Bunea, 2008; van de Geer, 2008; Kakade et al., 
2010; Negahban et al., 2009). As a special case, logistic regression with ii and elastic net 
regularization are studied by Bunea (2008). Furthermore, Kakade et al. (2010) have stud- 
ied the accuracy of sparse estimation through £i-regularization for the exponential family 
distributions. A more general frame of study is proposed and analyzed by Negahban et al. 
(2009) where regularization with "decomposable" norms is considered in M-estimation prob- 
lems. To provide the accuracy guarantees, these works generalize the Restricted Eigenvalue 
condition (Bickel et al., 2009) to ensure that the loss function is strongly convex over a 
restriction of its domain. We would like to emphasize that these sufficient conditions gen- 
erally hold with proper constants and with high probability only if one assumes that the 
true parameter is bounded. This fact is more apparent in some of the mentioned work (e.g., 
Bunea, 2008; Kakade et al., 2010), while in some others (e.g., Negahban et al., 2009) the 
assumption is not explicitly stated. We will elaborate on this matter in §2. Tewari et al. 
(2011) also proposed a coordinate-descent type algorithm for minimization of a convex and 
smooth objective over the convex signal/parameter models introduced in (Chandrasekaran 
et al.). This formulation includes the ^i-constrained minimization as a special case, and the 
algorithm is shown to converge to the minimum in objective value similar to the standard 
results in convex optimization. 

Furthermore, Shalev-Shwartz et al. (2010) proposed a number of greedy that sparsify 
a given estimate at the cost of relatively small increase of the objective function. How- 
ever, their algorithms are not stand-alone. A generalization of Compressed Sensing is also 
proposed in (Blumensath, 2010), where the linear measurement operator is replaced by a 
nonlinear operator that applies to the sparse signal. Considering the norm of the resid- 
ual error as the objective, Blumensath (2010) shows that if the objective satisfies certain 
sufficient conditions, the sparse signal can be accurately estimated by a generalization of 
the Iterative Hard Thresholding algorithm (Blumensath and Davies, 2009). The formu- 
lation of (Blumensath, 2010), however, has a limited scope because the metric of error is 
defined using a norm. For instance, the formulation does not apply to objectives such as 
the logistic loss. More recently, Jalali et al. (2011) studied a forward-backward algorithm 
using a variant of the sufficient conditions introduced in (Negahban et al., 2009). Similar 
to our work, the main result in (Jalali et al., 2011) imposes conditions on the function as 
restricted to sparse inputs whose non-zeros are fewer than a multiple of the target sparsity 
level. The multiplier used in their results has an objective-dependent value and is never less 
than 10. Furthermore, the multiplier is important in their analysis not only for determin- 
ing the stopping condition of the algorithm, but also in the lower bound assumed for the 



minimal magnitude of the non-zero entries. In contrast, the multipher in our results is fixed 
at 4, independent of the objective function itself, and we make no assumptions about the 
magnitudes of the non-zero entries. 

This paper presents an extended version with improved guarantees of our prior work 
in (Bahmani et al., 2011), where we proposed a greedy algorithm, the Gradient Support 
Pursuit (GraSP), for sparse estimation problems that arise in applications with general 
nonlinear models. We prove the accuracy of GraSP for a class of cost functions that have a 
Stable Restricted Hessian (SRH). The SRH, introduced in (Bahmani et al., 2011), charac- 
terizes the functions whose restriction to sparse canonical subspaces have well-conditioned 
Hessian matrices. Similarly, we analyze the GraSP algorithm for non-smooth functions that 
have a Stable Restricted Linearization (SRL), a property introduced in this paper, analo- 
gous to SRH. The analysis and the guarantees for smooth and non-smooth cost functions 
are similar, except for less stringent conditions derived for smooth cost functions due to 
properties of symmetric Hessian matrices. We also prove that the SRH holds for the case 
of the ^2-penalized logistic loss function. 

Notation 

In the remainder of this paper we use the notation listed in Table 1 . 

Paper Outline 

In §2 we provide a background on sparse parameter estimation which serves as an overview 
of prior work. In §3 we state the general formulation of the problem and present our algo- 
rithm. Conditions that characterize the cost functions and the main accuracy guarantees 
of our algorithm are provided in §3 as well. The guarantees of the algorithm are proved in 
Appendices A and B. As an example where our algorithm can be applied, ^2-i'egularized 
logistic regression is studied in §4. Some experimental results for logistic regression with 
sparsity constraints are presented in §5. Finally, §6 discusses the results and concludes. 

2. Background 

We first briefly review sparse estimation problems studied in the literature. 

2.1 Sparse Linear Regression and Compressed Sensing 

The special case of sparse estimation in linear models has gained significant attention under 
the title of Compressed Sensing (CS) (Donoho, 2006). In standard CS problems the aim is 
to estimate a sparse vector x* from noisy linear measurements y = Ax* + e, where A is 
a known n x p measurement matrix with n <^ p and e is the additive measurement noise. 
To find the sparsest estimate in this underdetermined problem that is consistent with the 
measurements y one needs to solve the optimization problem 

X =argmin ||x||q s.t. ||y — AxUg < e, (1) 

where e is a given upper bound for ||e||2 (Candes et al., 2006). In the absence of noise (i.e., 
when e = 0), if x* is s-sparse (i.e., it has at most s nonzero entries) one merely needs every 



Table 1: Notation used in this paper 



Symbol 



Description 



n 



X^ 



v|x 



the set {1,2, ... ,n} for any n G N 



calligraphic letters denote sets unless stated otherwise (e.g., M (/U, o"^) 
denotes a normal distribution) 



complement of set I 



bold face small letters denote column vectors in M^ for some 6 € N 



the ^g-norm of vector v, that is ( X^j^^ \vi 



1/9 



for a real number q > 1 



the "^o-iiorm" of vector v that merely counts its nonzero entries 



depending on the context 

1. restriction of vector v to the rows indicated by indices in I, or 

2. a vector that equals v except for coordinates in I'^ where it is zero 



Vr 


the best r-term approximation of vector v 


supp (v) 


the support set (i.e., indices of the non-zero entries) of v 


M 


bold face capital letters denote matrices in M"^^ for some a, 5 G N 


M^ 


transpose of matrix M 


Mt 


pseudo-inverse of matrix M 


Mx 


restriction of matrix M to the columns enumerated by I 


l|M|| 


the operator norm of matrix M which is equal to -y/Amax (M'^M) 


I 


the identity matrix 


Px 


restriction of the identity matrix to the columns indicated by X 


1 


column vector of all ones 


E[.] 


expectation 


H/(-) 


Hessian of the function / 



2s columns of A to be linearly independent to guarantee exact recovery (Donoho and Elad, 
2003). Unfortunately, the ideal solver (1) is computationally NP-hard in general (Natarajan, 
1995) and one must seek approximate solvers instead. 

It is shown in (Candes et al., 2006) that under certain conditions, minimizing the ii- 
norm as a convex proxy for the ^o-norm yields accurate estimates of x*. The resulting 
approximate solver basically returns the solution to the convex optimization problem 

X = argmin ||x||^ s.t. ||y — Ax||2 < e, (2) 

The required conditions for approximate equivalence of (1) and (2), however, generally hold 
only if measurements are collected at a higher rate. Ideally, one merely needs n = O {s) 
measurements to estimate x*, but n = 0(s log p/s) measurements are necessary for the 
accuracy of (2) to be guaranteed. 

The convex program (2) can be solved in polynomial time using interior point methods. 
However, these methods do not scale well as the size of the problem grows. Therefore, several 
first-order convex optimization methods are developed and analyzed as more efficient alter- 
natives (see e.g.. Beck and Teboulle (2009) and Agarwal et al. (2010)). Another category 
of low-complexity algorithms in CS are the non-convex greedy pursuits including Orthog- 
onal Matching Pursuit (OMP) (Pati et al., 1993; Tropp and Gilbert, 2007), Compressive 
Sampling Matching Pursuit (CoSaMP) (Needell and Tropp, 2009), Iterative Hard Thresh- 
olding (IHT) (Blumensath and Davies, 2009), and Subspace Pursuit (Dai and Milenkovic, 
2009) to name a few. These greedy algorithms implicitly approximate the solution to the 
^o-constrained least squares problem 
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x =argmin - lly — AxIL s.t. ||x||n < s. (3) 

The main theme of these iterative algorithms is to use the residual error from the previous 
iteration to successively approximate the position of non-zero entries and estimate their 
values. These algorithms have shown to exhibit accuracy guarantees similar to those of 
convex optimization methods, though with more stringent requirements. 

As mentioned above, to guarantee accuracy of the CS algorithms the measurement ma- 
trix should meet certain conditions such as incoherence (Donoho and Huo, 2001), Restricted 
Isometry Property (RIP) (Candes et al., 2006), Nullspace Property (Cohen et al., 2009), 
etc. Among these conditions RIP is the most commonly used and the best understood 
condition. 

Matrix A is said to satisfy the RIP of order k — in its symmetric form — with constant 
(5fc, if (^fc < 1 is the smallest number that 

(1 - 5k) ||x||2 < II Ax||2 < (1 + h) ||x||2 

holds for all fc-sparse vectors x. Several CS algorithms are shown to produce accurate 
solutions provided that the measurement matrix has a sufficiently small RIP constant of 
order ck with c being a small integer. For example, solving (2) is guaranteed to yield an 
accurate estimate of s-sparse x* if 62s < \/2 — 1 (Candes, 2008). Interested readers can find 
the best known RIP-based accuracy guarantees for some of the CS algorithms in (Foucart, 
2012). 



2.2 Beyond Linear Models 

The CS reconstruction algoritlims attempt to provide a sparse vector that incurs only a 
small squared error which measures consistency of the solution versus the acquired data. 
While this measure of discrepancy is often desirable for signal processing applications, it 
is not the appropriate choice for a variety of other applications. For example, in statistics 
and machine learning the logistic loss function is also commonly used in regression and 
classification problems (see Liu et al., 2009, and references therein). Thus, it is desirable to 
develop theory and algorithms that apply to a broader class of optimization problems with 
sparsity constraints. 

The existing studies on this subject are mostly in the context of statistical estimation. 
The majority of these studies consider the cost function to be convex everywhere and rely on 
the £i-regularization as the means to induce sparsity in the solution. For example, Kakade 
et al. (2010) have shown that for the exponential family of distributions maximum likelihood 
estimation with £i-regularization yields accurate estimates of the underlying sparse param- 
eter. Furthermore, Negahban et al. have developed a unifying framework for analyzing 
statistical accuracy of M -estimators regularized by "decomposable" norms in (Negahban 
et al., 2009). In particular, in their work ^i-regularization is applied to Generalized Linear 
Models (GLM) (Dobson and Barnett, 2008) and shown to guarantee a bounded distance 
between the estimate and the true statistical parameter. To establish this error bound they 
introduced the notion of Restricted Strong Convexity (RSC), which basically requires a lower 
bound on the curvature of the cost function around the true parameter in a restricted set 
of directions. The achieved error bound in this framework is inversely proportional to this 
curvature bound. Furthermore, Agarwal et al. (2010) have studied Projected Gradient De- 
scent as a method to solve £i-constrained optimization problems and established accuracy 
guarantees using a slightly different notion of RSC and Restricted Smoothness (RSM). 

Note that the guarantees provided for majority of the £i-regularization algorithms pre- 
sume that the true parameter is bounded, albeit implicitly. For instance, the error bound 
for £i -regularized logistic regression is recognized by Bunea (2008) to be dependent on the 
true parameter (Bunea, 2008, Assumption A, Theorem 2.4, and the remark that succeeds 
them). Moreover, the result proposed by Kakade et al. (2010) implicitly requires the true 
parameter to have a sufficiently short length to allow the choice of the desirable regulariza- 
tion coefficient (Kakade et al., 2010, Theorems 4.2 and 4.5). Negahban et al. (2009) also 
assume that the true parameter is inside the unit ball to establish the required condition 
for their analysis of ^i-regularized GLM, although this restriction is not explicitly stated 
(see the longer version of Negahban et al., 2009, p. 37). We can better understand why 
restricting the length of the true parameter may generally be inevitable by viewing these 
estimation problems from the perspective of empirical processes and their convergence. The 
empirical processes, including those considered in the studies mentioned above, are generally 
good approximations of their corresponding expected process (see (Vapnik, 1998, Chapter 
5) and (van de Geer, 2000)). Therefore, if the expected process is not strongly convex over 
an unbounded, but perhaps otherwise restricted, set the corresponding empirical process 
cannot be strongly convex over the same set. This reasoning applies in many cases including 
the studies mentioned above, where it would be impossible to achieve the desired restricted 
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strong convexity properties — with high probabihty — if the true parameter is allowed to be 
unbounded. 

Furthermore, the methods that rely on the ^i-norm are known to result in sparse solu- 
tions, but, as mentioned in (Kakade et al., 2010), the sparsity of these solutions is not known 
to be optimal in general. One can intuit this fact from definitions of RSC and RSM. These 
two properties bound the curvature of the function from below and above in a restricted 
set of directions around the true optimum. For quadratic cost functions, such as squared 
error, these curvature bounds are absolute constants. As stated before, for more general 
cost functions such as the loss functions in GLMs, however, these constants will depend 
on the location of the true optimum. Consequently, depending on the location of the true 
optimum these error bounds could be extremely large, albeit finite. When error bounds 
are significantly large, the sparsity of the solution obtained by £i-regularization may not be 
satisfactory. This motivates investigation of algorithms that do not rely on ^i-norm as a 
proxy. 

3. Problem Formulation and the GraSP Algorithm 

As seen in §2.1, in standard CS the squared error /(x) = 5 ||y — Ax||2 is used to measure 
data fidelity. While this is appropriate for a large number of signal acquisition applications, 
it is not the right cost in other fields. Thus, the significant advances in CS cannot readily be 
applied in these fields when estimation or prediction of sparse parameters become necessary. 
In this paper we focus on a generalization of (3) where a generic cost function replaces the 
squared error. Specifically, for the cost function / : M^ 1— )• M, it is desirable to approximate 

argmin/(x) s.t. ||x||q < s. (4) 

We propose the Gradient Support Pursuit (GraSP) algorithm, which is inspired by and 
generalizes the CoSaMP algorithm, to approximate the solution to (4) for a broader class 
of cost functions. 

Of course, even for a simple quadratic objective, (4) can have combinatorial complexity 
and become NP-hard. However, similar to the results of CS, knowing that the cost function 
obeys certain properties allows us to obtain accurate estimates through tractable algorithms. 
To guarantee that GraSP yields accurate solutions and is a tractable algorithm, we also 
require the cost function to have certain properties that will be described in §3.2. These 
properties are analogous to and generalize the RIP in the standard CS framework. For 
smooth cost functions we introduce the notion of a Stable Restricted Hessian (SRH) and 
for non-smooth cost functions we introduce the Stable Restricted Linearization (SRL). Both 
of these properties basically bound the Bregman divergence of the cost function restricted 
to sparse canonical subspaces. However, the analysis based on the SRH is facilitated by 
matrix algebra that results in somewhat less restrictive requirements for the cost function. 

3.1 Algorithm Description 

GraSP is an iterative algorithm, summarized in Algorithm 1, that maintains and updates 
an estimate x of the sparse optimum at every iteration. The first step in each iteration, z = 
V/ (x), evaluates the gradient of the cost function at the current estimate. For nonsmooth 
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Algorithm 1: The GraSP algorithm 



input : / (•) and s 
output: X 

initialize: x = 
repeat 

compute local gradient: z = V/ (x) 

identify directions: Z = supp (z2s) 

merge supports: T = Z U supp (x) 

minimize over support: b = argmin/(x) s.t. x|^c 

prune estimate: x = b^ 
until halting condition holds 



functions, instead of the gradient we use the restricted subgradient z = Vj (x) defined in 
§3.2. Then 2s coordinates of the vector z that have the largest magnitude are chosen as 
the directions in which pursuing the minimization will be most effective. Their indices, 
denoted hy Z = supp(z2s), are then merged with the support of the current estimate to 
obtain T = Z U supp (x). The combined support is a set of at most 3s indices over which the 
function / is minimized to produce an intermediate estimate b = arg min / (x) s.t. x|^c =0. 
The estimate x is then updated as the best s-term approximation of the intermediate 
estimate b. The iterations terminate once certain condition, e.g., on the change of the 
cost function or the change of the estimated minimum from the previous iteration, holds. 

In the special case where the squared error / (x) = glly^AxHg is the cost func- 
tion, GraSP reduces to CoSaMP. Specifically, the gradient step reduces to the proxy step 
z = a""" (y — Ax) and minimization over the restricted support reduces to the constrained 
pseudoinverse step b|^ = A^y, b|^c = in CoSaMP. 

Variants Although in this paper we only analyze the standard form of GraSP outlined 
in Algorithm 1, other variants of the algorithm can also be studied. Below we list some of 
these variants. 

1. Debiasing: In this variant, instead of performing a hard thresholding on the vector 
b, the objective is minimized restricted to the support set of b^ to obtain the new 
iterate: 

X = arg min / (x) s.t. supp (x) C supp (b^) . 

X 

2. Restricted Newton Step: To reduce the computations in each iteration, the minimiza- 
tion that yields b, we can set b|^c = and take a restricted Newton step as 

h\^=St\^-K {P^Uf (x) Pr) "' X|^ , 

where k > is a step-size. Of course, here we are assuming that the restricted Hessian, 
Plj-tlf (x) P7-, is invertible. 



3. Restricted Gradient Descent: The minimization step can be relaxed even further by 
applying a restricted gradient descent. In this approach, we again set b|^c = and 

b|r = ^\r~ '^ V/(x)lr- 

Since T contains both the support set of x and the 2s-largest entries of V/ (x) , it is 
easy to show that each iteration of this alternative method is equivalent to a standard 
gradient descent followed by a hard thresholding. In particular, if the squared error 
is the cost function as in standard CS, this variant reduces to the IHT algorithm. 

3.2 Sparse Reconstruction Conditions 

In what follows we characterize the functions for which accuracy of GraSP can be guaran- 
teed. For twice continuously differentiable functions we rely on Stable Restricted Hessian 
(SRH), while for non-smooth cost functions we introduce the Stable Restricted Lineariza- 
tion (SRL). These properties that are analogous to the RIP in the standard CS framework, 
basically require that the curvature of the cost function over the sparse subspaces can be 
bounded locally from above and below such that the corresponding bounds have the same 
order. Below we provide precise definitions of these two properties. 

Definition 1 (Stable Restricted Hessian). Suppose that / is a twice continuously differen- 
tiable function whose Hessian is denoted by Hj (•). Furthermore, let 

^fc(x) =sup|a'^H/(x) A |supp(x)Usupp(A)| < fc, IIAII2 = l| (5) 



and 



5fc(x) =inf |a^H/(x) A : |supp (x) U supp (A)| < fc, ||A||2 = l} , 



(6) 



for all /c-sparse vectors x. Then / is said to have a Stable Restricted Hessian (SRH) with 
constant fij^, or in short /i^-SRH, if 1 < J" (I < fJ-k- 

Remark 1. Since the Hessian of / is symmetric, an equivalent for Definition 1 is that a twice 
continuously differentiable function / has /x^-SRH if the condition number of FjcHf (x) P^ 
is not greater than fi^ for all fc-sparse vectors x and sets /C C [p] with |supp (x) U /C| < k. 

In the special case when the cost function is the squared error as in (3), we can write 
tlf (x) = A A which is constant. The SRH condition then requires 

Bk\\A\\l < IIAAII^ < ^fcllAII^ 

K \\ \\2 — II W 2 — ^11 llz 

to hold for all A:-sparse vectors A with A^/Bk < Hk- Therefore, in this special case the SRH 
condition essentially becomes equivalent to the RIP condition. 

Remark 2. Note that the functions that satisfy the SRH are convex over canonical sparse 
subspaces, but they are not necessarily convex everywhere. The following two examples 
describe some non-convex functions that have SRH. 



Example 1. Let / (x) = ^x Qx, where Q = 2 x 11 — I. Obviously, we have Hj (x) = Q. 
Therefore, (5) and (6) determine the extreme eigenvalues across all of the k x k symmetric 
submatrices of Q. Note that the diagonal entries of Q are all equal to one, while its 
off-diagonal entries are all equal to two. Therefore, for any 1-sparse signal u we have 
u^Qu = ||u||2, meaning that / has ^i-SRH with fii = 1. However, for u = [1, —1, 0, . . . , 0] 
we have u^Qu < 0, which means that the Hessian of / is not positive semi-definite (i.e., / 
is not convex). 

Example 2. Let / (x) = ^ \\^\\2 + CxiX2 ■ ■ ■ Xk+i where the dimensionality of x is greater 
than k. It is obvious that this function is convex for fc-sparse vectors as xiX2 • • • Xk+i = 
for any fc-sparse vector. So we can easily verify that / satisfies SRH of order k. However, 
for xi = a;2 = • • • = Xk+i = t and Xj = for i > A; -|- 1 the restriction of the Hessian of / to 
indices in [k + 1] (i.e., Pj^ -|^,Hj (x) P[;j_|_i]) is a matrix with diagonal entries all equal to one 

and off-diagonal entries all equal to Ct^~^ . Let Q denote this matrix and u be a unit-norm 
vector such that (u, 1) = 0. Then it is straightforward to verify that u'^Qu = 1 — Ct^~^ , 
which can be negative for sufficiently large values of C and t. Therefore, the Hessian of / 
is not positive semi-definite everywhere, meaning that / is not convex. 

To generalize the notion of SRH to the case of nonsmooth functions, first we define the 
restricted subgradient of a function. 

Definition 2 (Restricted Subgradient). We say vector V/ (x) is a restricted subgradient 
of / : RP h^ M at point x if 

/(x + A)-/(x)>(V/(x),A) 

holds for all fc-sparse vectors A. 

Remark 3. We introduced the notion of restricted subgradient so that the restrictions im- 
posed on / are as minimal as we need. We acknowledge that the existence of restricted 
subgradients implies convexity in sparse directions, but it does not imply convexity every- 
where. 

Remark 4. Obviously, if the function / is convex everywhere, then any subgradient of / 
determines a restricted subgradient of / as well. In general one may need to invoke the 
axiom of choice to define the restricted subgradient. 

Remark 5. We drop the sparsity level from the notation as it can be understood from the 
context. 

With a slight abuse of terminology we call 

B; (x' II x) = / (x') - / (x) - (V/ (x) ,x' - x> , 

the restricted Bregman divergence of / : M^ i— >• M between points x and x' where Vj (•) 
gives a restricted subgradient of / (•). 

Definition 3 (Stable Restricted Linearization). Let x be a fc-sparse vector in W. For 
function / : M^ i— )• M we define the functions 

Ofc (x) = sup < 2^/ (x + A II x) I A 7^ and |supp (x) U supp (A)| < A; > 
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and 

/3fc (x) = inf <^ 2^/ (x + A II x) | A / and |supp (x) U supp (A)| < A; > . 

[ 11^112 ' J 

Then / (•) is said to have a Stable Restricted Linearization with constant /Xfc, or /^fe-SRL, if 
'^J' (I < /ifc for all A;-sparse vectors x. 

Remark 6. The SRH and SRL conditions are similar to various forms of the Restricted 
Strong Convexity (RSC) and Restricted Strong Smoothness (RSS) conditions (Negahban 
et al., 2009; Agarwal et al., 2010; Blumensath, 2010; JalaU et al., 2011; Zhang, 2011) in 
the sense that they all bound the curvature of the objective function over a restricted set. 
The SRL condition quantifies the curvature in terms of a (restricted) Bregman divergence 
similar to RSC and RSS. The quadratic form used in SRH can also be converted to the Breg- 
man divergence form used in RSC and RSS and vice-versa using the mean-value theorem. 
However, compared to various forms of RSC and RSS conditions SRH and SRL have some 
important distinctions. The main difference is that the bounds in SRH and SRL conditions 
are not global constants; only their ratio is required to be bounded globally. Furthermore, 
unlike the SRH and SRL conditions the variants of RSC and RSS, that are used in convex 
relaxation methods, are required to hold over a set which is strictly larger than the set of 
canonical /c-sparse vectors. 

There is also a subtle but important difference regarding the points where the curvature 
is evaluated at. Since Negahban et al. (2009) analyze a convex program, rather than an 
iterative algorithm, they only needed to invoke the RSC and RSS at a neighborhood of the 
true parameter. In contrast, the other variants of RSC and RSS (see e.g., (Agarwal et al., 
2010; Jalali et al., 2011)), as well as our SRH and SRL conditions, require the curvature 
bounds to hold uniformly over a larger set of points, thereby they are more stringent. 

3.3 Main Theorems 

Now we can state our main results regarding approximation of 

x* = argmin /(x) s.t. ||x||q < s, (7) 

using the GraSP algorithm. 

Theorem 1. Suppose that f is a twice continuously differentiable function that has fi^g- 
SRH with fi4s ^ 2 • Furthermore, suppose that for some e > we have e < B^s (u) for 
all u. Then St^^', the estimate at the i-th iteration, satisfies 



X»-x^ 



^<2-||x1l2 + ^^||V/(x*)|^||2, 



where I is the position of the 3s largest entries of V/ (x*) in magnitude. 

Remark 7. Note that this result indicates that V/ (x*) determines how accurate the es- 
timate can be. In particular, if the sparse minimum x* is sufficiently close to an uncon- 
strained minimum of / then the estimation error floor is negligible because V/ (x*) has 
small magnitude. This result is analogous to accuracy guarantees for estimation from noisy 
measurements in CS (Candes et al., 2006; Needell and Tropp, 2009). 
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Remark 8. As the derivations required to prove Theorem 1 show, the provided accuracy 
guarantee holds for any s-sparse x*, even if it does not obey (7). Obviously, for arbitrary 
choices of x*, V/ (x*)|j may have a large norm that cannot be bounded properly which im- 
plies large errors. In statistical estimation problems, often the true parameter that describes 
the data is chosen as the target parameter x* rather than the minimizer of the average loss 
function as in (7). In these problems, the approximation error || V/ (x*)|j||2 has statistical 
interpretation and can determine the statistical precision of the problem. This property is 
easy to verify in linear regression problems. We will also show this for the logistic loss as 
an example in §4. 

Nonsmooth cost functions should be treated differently, since we do not have the luxury 
of working with Hessian matrices for these type of functions. The following theorem provides 
guarantees that are similar to those of Theorem 1 for nonsmooth cost functions that satisfy 
the SRL condition. 

Theorem 2. Suppose that f is a function that is not necessarily smooth, hut it satisfies 
fi4s-SRL with fi4s < "Y • Furthermore, suppose that for P^s (■) in Definition 3 there exists 
some e > such that fi^s (x) > e holds for all As-sparse vectors x. Then x'*', the estimate 
at the i-th iteration, satisfies 



X(^) - x^ 



^^ I|x|l2 + ; l|V/(x j|2:||2' 



2 e 

where X is the position of the 3s largest entries of Vj (x*) in magnitude. 

Remark 9. Should the SRH or SRL conditions hold for the objective function, it is straight- 
forward to convert the point accuracy guarantees of Theorems 1 and 2, into accuracy guar- 
antees in terms of the objective value. First we can use SRH or SRL to bound the Bregman 
divergence, or its restricted version defined above, for points x'*^ and x*. Then we can ob- 
tain a bound for the accuracy of the objective value by invoking the results of the theorems. 
This indirect approach, however, might not lead to sharp bounds and thus we do not pursue 
the detailed analysis in this work. 

4. Example: Sparse Minimization of £2-regularized Logistic Regression 

One of the models widely used in machine learning and statistics is the logistic model. In 
this model the relation between the data, represented by a random vector a G M^, and its 
associated label, represented by a random binary variable y £ {0, 1}, is determined by the 
conditional probability 

T3 r I 1 exp(y(a,x)) 

Prjy a;x} = — --, 8 

l + exp((a,x)) 

where x denotes a parameter vector. Then, for a set of n independently drawn data samples 
{{sii,yi)}^=i the joint likelihood can be written as a function of x. To find the maximum 
likelihood estimate one should maximize this likelihood function, or equivalently minimize 
the negative log-likelihood, the logistic loss, 

1 " 
S'W = -y]log(l + exp((ai,x))) -yi(aj,x). (9) 

n ^-^ 
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It is well-known that g {■) is strictly convex for p < n provided that the associated design 
matrix, A = [ai a2 ... a„] , is full-rank. However, in many important applications (e.g., 
feature selection) the problem can be underdetermined (i.e., n < p). In these scenarios the 
logistic loss is merely convex and it does not have a unique minimum. Furthermore, it is 
possible, especially in underdetermined problems, that the observed data is linearly sepa- 
rable. In that case one can achieve arbitrarily small loss values by tending the parameters 
to infinity along certain directions. To compensate for these drawbacks the logistic loss is 
usually regularized by some penalty term (Hastie et al., 2009; Bunea, 2008). 

One of the candidates for the penalty function is the (squared) ^2-iiorm of x (i.e., ||x||2). 
Considering a positive penalty coefficient t] the regularized loss is 



/(x) = g{x) + -\\x\\l. 

For any convex g (■) this regularized loss is guaranteed to be 77-strongly convex, thus it 
has a unique minimum. Furthermore, the penalty term implicitly bounds the length of 
the minimizer thereby resolving the aforementioned problems. Nevertheless, the £2 penalty 
does not promote sparse solutions. Therefore, it is often desirable to impose an explicit 
sparsity constraint, in addition to the I2 regularizer. 

4.1 Verifying SRH for ^2-regularized logistic loss 

It is easy to show that the Hessian of the logistic loss at any point x is given by Hg(x) = 
I^A"'"AA, where A is an n x n diagonal matrix whose diagonal entries An = sech 2 {sh,^) 
with sech(-) denoting the hyperbolic secant function. Note that ^ Hg (x) ^ j^A A. 
Therefore, if H^ (x) denotes the Hessian of the ^2-regularized logistic loss, we have 

Vx,A ?7||A||2 < A'^H^(x) A < — ||AA||2 + ??||A||2. (10) 

To verify SRH, the upper and lower bounds achieved at /c-sparse vectors A are of particular 
interest. It only remains to find an appropriate upper bound for ||AA||2 in terms of || AII2. 
To this end we use the following result on Chernoff bounds for random matrices due to 
Tropp (2011). 

Theorem 3 (Matrix Chernoff (Tropp, 2011)). Consider a finite sequence {Mj} of k x k, 
independent, random, self-adjoint matrices that satisfy 

Mj )^ and Amax (Mj) < R almost surely. 

Let 9m^^ := Amax (EiE [Mi]). Then for r > 0, 



(j^M,j > (l+r)0n,axi <A;exp 



Pr<;Amax( >.M, ) > (1 + r) 0n,ax }> <A; cxp ( ^(r-(l+r)log(l + r 

As stated before, in a standard logistic model data samples {aj} are supposed to be 
independent instances of a random vector a. In order to apply Theorem 3 we need to make 
the following extra assumptions: 
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Assumption. For every J C \p\ with 1^71 = /c, 

II 1 1 2 

(i) we have 1 1 a| j- 1 L < R almost surely, and 
(ii) none of the matrices P j-IE [aa ] Pj- is the zero matrix. 
We define 6*^^^ := A^ax (P^CPj^), where C = E [aa'^] , and let 



0^„^ and Q := min 9^. 



max' 



JQ[p] , \J\=k JC[p] , \J\=k 

To simplify the notation henceforth we let h (r) = (1 + r) log (1 + r) — r. 

Corollary 1. With the above assumptions, ifn > R (log /c + /c (l + log |) — loge) / i9h{T 
for some r > and e € (0, 1), then with probability at least 1 — e the i2-regularized logistic 
loss has fik-SRH with /xfc < 1 + '^^• 

Proof For any set of k indices J^ let Mv'^ = SLi\j anlj = PXaja^^Pj-. The independence 
of the vectors aj implies that the matrix 



i=l 
n 



is a sum of n independent, random, self-adjoint matrices. Assumption (i) implies that 

/ Tx II 1 1 2 

Amax (M,- j = llailj-llo — ^ almost surely. Furthermore, we have 

Amax ( Yl ^ [^i^] ) = ^-- ( Yl ^ [P^^ia^P^] J 

= A^axfX^P;^E[a,aT]p^j 

= A^axf^P^CP^j 

= nAinax {FjCPj) 
~ ^"max- 

Hence, for any fixed index set J7 with \J'\ = k we may apply Theorem 3 for Mj = M^, 
nOH^^, and r > to obtain 



^max 



Pr|A_ (J:mA > il + r)nei\ <feexp (- "^--^ ("^ ) 
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Furthermore, we can write 

Pr {A„,ax (A^A^) > (1 + r) nO] = Pr J A^^ax f ^^ Mf J > (1 + r) n^ i 

< kexp 



nO^i^^h (r) 



R J 

( n9h{T) \ 



Note that Assumption (ii) guarantees that 6 > 0, and thus the above probabihty bound wiU 

(D 



not be vacuous for sufficiently large n. To ensure a uniform guarantee for all (?) possible 



choices of J7 we can use the union bound to obtain 



V A„,ax(A^A^)>(l+r)n0l < J^ Pr {A^ax(A^A^) > (1+r) n^ 



JQlp] JQlp] 

\J\=k ) \J\=k 



<k[P)expl "^^(") 



Pr< 



ykj \ R I 

/pe\fc / neh{T)\ 
< k [-) exp ^ —j 

= exp logfc+fc+fclog- — . 

Therefore, for e € (0, 1) and n > R (log k + k [l + log |) — loge) / ( 9h (r) 1 it follows from 
(10) that for any x and any /c-sparse A, 

ry II A||2 < A^H, (x) A < (r/ + ^9^ || A||2 

holds with probability at least 1 — e. Thus, the £2 -regularized logistic loss has an SRH 
constant /ifc < 1 + -^9 with probability 1 — e. ■ 

Remark 10. One implication of this result is that for a regime in which k and p grow 
sufficiently large while | remains constant one can achieve small failure rates provided 
that n = Q (^Rk log |). Note that R is deliberately included in the argument of the order 
function because in general R depends on k. In other words, the above analysis may require 
n = O (/c^ log I) as the sufficient number of observations. This bound is a consequence of 
using Theorem 3, but to the best of our knowledge, other results regarding the extreme 
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eigenvalues of the average of independent random PSD matrices also yield an n of the 
same order. If matrix A has certain additional properties (e.g., independent and sub- 
Gaussian entries), however, a better rate of n = Q (A; log |) can be achieved without using 
the techniques mentioned above. 

Remark 11. The analysis provided here is not specific to the ^2-regularized logistic loss and 
can be readily extended to any other £2 -regularized GLM loss whose log-partition function 
has a Lipschitz-continuous derivative. 



4.2 Bounding the approximation error 

We are going to bound || V/ (x*)|j||2 which controls the approximation error in the state- 
ment of Theorem 1. In the case of case of ^2-i'egularized logistic loss considered in this 
section we have 



V/(x) 



^ Vl + exp(-(ai,x)) 



yi] Sii + r]x. 



Denoting ^ ^^ ,\_^ ^^,v — yi hy Vi for i = 1,2, ... ,n then we can deduce 



|V/(x* 



IXII2 



1 " 

-E 



n 



Vi Siilj + 7] :x*\j 



i=l 



n 



Ajv + 7?x*|j 



1 



< - IIAtII IIvIL -1-77 IIx* 
n " " 



XII2 



<^l|Ax| 
/n 



1 " 



vf + V\\^^\x\\2^ 



where v = [f 1 f 2 • • • Wn] • Note that ViS are n independent copies of the random variable 
V = Yj- — 7377 — jyr — y that is zero- mean and always lie in the interval [—1,1]. Therefore, 
applying the Hoeffding's inequality yields 



^M ^ E ^«' ^ (1 + ^) '^^ [ ^ ^^P (-2"cV^) , 



i=l 
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where o"^ = E [v^] is the variance of v. Furthermore, using the logistic model (8) we can 



deduce 

-2 - E [r^^ 



0-,, 



V- 



E[E[t;2 I a]] 



E 



E 



(y-E[y|a])2|a 



E [var (y I a)] 
E 



< 



E 
1 



^ exp((a,x*)) 

1 + exp ((a, X*)) 1 + exp((a, X*)) 

1 

2 + exp((a, X*)) +exp(- (a,x*)) 



(because y | a ~ Bernoulli as in (8)) 



(because exp (t) + exp (— t) > 2). 



Therefore, we have - X^i^i ^f < i with high probability. As in the previous subsection one 
can also bound -^ ||Ax|| = \/ ^Xi, 
high probability we have 



|V/(x* 



(AjAj) using (11) with k = \X\ = 3s. Hence, with 
1 



\X\\2 



<-^{l + r)e + r,\\i^ 



Interestingly, this analysis can also be extended to the GLMs whose log-partition function 
V' (•) obeys < Tp" (t) < C for all t with C being a positive constant. For these models the 
approximation error can be bounded in terms of the variance of v^ = ip' ((a, x*)) — y. 



5. Experimental Results 

Synthetic Data 

Algorithms that are used for sparsity-constrained estimation or optimization often induce 
sparsity using different types of regularizations or constraints. Therefore, the optimized ob- 
jective function may vary from one algorithm to another, even though all of these algorithms 
try to estimate the same sparse parameter and sparsely optimize the same original objec- 
tive. Because of the discrepancy in the optimized objective functions it is generally difficult 
to compare performance of these algorithms. Applying algorithms on real data generally 
produces even less reliable results because of the unmanageable or unknown characteristics 
of the real data. Hence, we restrict our simulations to application of the logistic model on 
synthetically generated data, and evaluate performance of GraSP based on the value of the 
original objective and the estimation accuracy for a sparse parameter. 

In our simulations the sparse parameter of interest x* is a p = 1000 dimensional vector 
that has s = 10 nonzero entries drawn independently from the standard Gaussian distri- 
bution. An intercept c G M is also considered which is drawn independently of the other 
parameters according to the standard Gaussian distribution. Each data sample is an inde- 



pendent instance of the random vector a : 
process (Hamilton, 1994) determined by 

paj + Vl 



^i+i 



p^Zj, 



[ai , 02 , . . . , ttp] generated by an autoregressive 



for all J G [p — 1] 
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with ai ~ A/'(0, 1), Zj ~ AA(0, 1), and p G [0, 1] being the correlation parameter. The data 
model we describe and use above is identical to the experimental model used in (Agarwal 

et al., 2010), except that we adjusted the coefficients to ensure that E a^ = 1 for all j S [p]. 

The data labels, y € {0, 1} are then drawn randomly according to the Bernoulli distribution 
with 

Pr {y = I a} = 1/ (1 + exp ((a, x*) + c)) . 

We compared GraSP to the LASSO algorithm implemented in the GLMnet pack- 
age (Friedman et al., 2010), as well as the Orthogonal Matching Pursuit method dubbed 
Logit-OMP (Lozano et al., 2011). To isolate the effect of ^2-i'egularization, both LASSO 
and the basic implementation of GraSP did not consider additional ^2-i'egularization terms. 
To analyze the effect of an additional ^2-i'egularization we also evaluated the performance 
of GraSP with £2-regularized logistic loss, as well as the logistic regression with elastic 
net (i.e., mixed ^1-^2) penalty also available in the GLMnet package. We configured the 
GLMnet software to produce s-sparse solutions for a fair comparison. For the elastic net 
penalty (1 — lo) ||x||2 /2 + uj \\^\\i we considered the "mixing parameter" uj to be 0.8. For the 



^2-regularized logistic loss we considered r/ = (1 — w) \/-^^- For each choice of the number 

of measurements n between 50 and 1000 in steps of size 50, and p in the set -^ 0, |, j, -^ ^ 
we generate the data and the associated labels and apply the algorithms. The average 
performance is measured over 200 trials for each pair of {n,p). 

Fig. 1 compares the average value of the empirical logistic loss achieved by each of 
the considered algorithms for a wide range of "sampling ratio" n/p. For GraSP, the curves 
labelled by GraSP and GraSP + I2 corresponding to the cases where the algorithm is applied 
to unregularized and £2-i"egularized logistic loss, respectively. Furthermore, the results of 
GLMnet for the LASSO and the elastic net regularization are labelled by GLMnet (^1) and 
GLMnet (elastic net), respectively. The simulation result of the Logit-OMP algorithm is 
also included. To contrast the obtained results we also provided the average of empirical 
logistic loss evaluated at the true parameter and one standard deviation above and below 
this average on the plots. Furthermore, we evaluated performance of GraSP with the 
debiasing procedure described in §3.1. 

As can be seen from the figure at lower values of the sampling ratio GraSP is not 
accurate and does not seem to be converging. This behavior can be explained by the fact 
that without regularization at low sampling ratios the training data is linearly separable or 
has very few mislabelled samples. In either case, the value of the loss can vary significantly 
even in small neighborhoods. Therefore, the algorithm can become too sensitive to the 
pruning step at the end of each iteration. At larger sampling ratios, however, the loss 
from GraSP begins to decrease rapidly, becoming effectively identical to the loss at the 
true parameter for n/p > 0.7. The results show that unlike GraSP, Logit-OMP performs 
gracefully at lower sampling ratios. At higher sampling ratios, however, GraSP appears to 
yield smaller bias in the loss value. Furthermore, the difference between the loss obtained 
by the LASSO and the loss at the true parameter never drops below a certain threshold, 
although the convex method exhibits a more stable behaviour at low sampling ratios. 

Interestingly, GraSP becomes more stable at low sampling ratios when the logistic loss is 
regularized with the £2-iiorm. However, this stability comes at the cost of a bias in the loss 
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Figure 1: Comparison of the average (empirical) logistic loss at solutions obtained via 
GraSP, GraSP with ^2-penalty, LASSO, the elastic- net regularization, and Logit-OMP. The 
results of both GraSP methods with "debiasing" are also included. The average loss at the 
true parameter and one standard deviation interval around it are plotted as well. 



value at high sampling ratios that is particularly pronounced in Fig. Id. Nevertheless, for all 
of the tested values of p, at low sampling ratios GraSP-l-^2 and at high sampling ratios GraSP 
are consistently closer to the true loss value compared to the other methods. Debiasing the 
iterates of GraSP also appears to have a stabilizing effect at lower sampling ratios. For 
GraSP with I2 regularized cost, the debiasing particularly reduced the undesirable bias at 

F 2 • 

Fig. 2 illustrates the performance of the same algorithms in terms of the relative error 
||x — x*||2 / ||x*||2 where x denotes the estimate that the algorithms produce. Not sur- 
prisingly, none of the algorithms attain an arbitrarily small relative error. Furthermore, 
the parameter p does not appear to affect the performance of the algorithms significantly. 
Without the .^2-regularization, at high sampling ratios GraSP provides an estimate that has 
a comparable error versus the .^1 -regularization method. However, for mid to high sam- 
pling ratios both GraSP and GLMnet methods are outperformed by Logit-OMP. At low to 
mid sampling ratios, GraSP is unstable and does not converge to an estimate close to the 
true parameter. Logit-OMP shows similar behavior at lower sampling ratios. Performance 
of GraSP changes changes dramatically once we consider the ^2-1'egularization and/or the 



19 



10' 



10" 



* 

/I \\/ \ 


:t:g;iliF + ,. 

■■A- GraSP (debias) 

■■•■GraSP + i-! (debias) 

-•-GLMnet (/',) 

-•■ GLMnet (elastic net) 

-♦-Logit-OMP 




*Sr*— -„..... ^ i 







0.2 



0.4 0.6 

n/p 

(a) p = 



GraSP (debias) 
GraSP + )2 (debias) 
■GLMnet {l\) 
GLMnet (elastic net) 
Logit-OMP 




0.4 0.6 

n/p 

(b) p = 1/3 



eraSP 
raSP -H i-i 
GraSP (debias) 
GraSP + l-i (debias) 
GLMnet (h) 
GLMnet (elastic net) 
Logit-OMP 




(C) p = 1/2 



(d) p = ^/2 



Figure 2: Comparison of the average relative error (i.e., ||x — x*||2 / ||x*||2) in logarithmic 
scale at solutions obtained via GraSP, GraSP with .^2-penalty, LASSO, the elastic-net reg- 
ularization, and Logit-OMP. The results of both GraSP methods with "debiasing" are also 
included. 



debiasing procedure. With ^2-i'egularization, GraSP achieves better relative error com- 
pared to GLMnet and ordinary GraSP for almost the entire range of tested sampling ratios. 
Applying the debiasing procedure has improved the performance of both GraSP methods 
except at very low sampling ratios. These variants of GraSP appear to perform better than 
Logit-OMP for almost the entire range of n/p. 

Real Data 

We also conducted the same simulation on some of the data sets used in NIPS 2003 Work- 
shop on feature extraction (Guyon et al., 2003), namely the ARCENE and DEXTER data 
sets. The logistic loss values at obtained estimates are reported in Tables 2 and 3. For each 
data set we applied the sparse logistic regression for a range of sparsity level s. The columns 
indicated by "G" correspond to different variants of GraSP. Suffixes £2 and "d" indicate the 
^2-i'egularization and the debiasing are applied, respectively. The columns indicated by li 
and E-net correspond to the results of the .^i-regularization and the elastic-net regulariza- 
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Table 2: ARCENE 



s 


G 


Gd 


Gi2 


Gi2d 


h 


E-net 


Logit-OMP 


5 


5.89E+01 


5.75E-01 


2.02E+01 


5.24E-01 


5.59E-01 


6.43E-01 


2.23E-01 


10 


3.17E+02 


5.43E-01 


3.71E+01 


4.53E-01 


5.10E-01 


5.98E-01 


5.31E-07 


15 


3.38E+02 


6.40E-07 


5.94E+00 


1.42E-07 


4.86E-01 


5.29E-01 


5.31E-07 


20 


1.21E+02 


3.44E-07 


8.82E+00 


3.08E-08 


4.52E-01 


5.19E-01 


5.31E-07 


25 


9.87E+02 


1.13E-07 


4.46E+01 


1.35E-08 


4.18E-01 


4.96E-01 


5.31E-07 



Table 3: DEXTER 



s 


G 


Gd 


Gi2 


Gi2d 


h 


E-net 


Logit-OMP 


5 


7.58E+00 


3.28E-01 


3.30E+00 


2.80E-01 


5.75E-01 


6.08E-01 


2.64E-01 


10 


1.08E+00 


1.79E-01 


4.33E-01 


1.28E-01 


5.23E-01 


5.33E-01 


1.79E-01 


15 


6.06E+00 


1.71E-01 


3.35E-01 


1.17E-01 


4.88E-01 


4.98E-01 


1.16E-01 


20 


1.30E+00 


8.84E-02 


1.79E-01 


8.19E-02 


4.27E-01 


4.36E-01 


4.60E-02 


25 


1.17E+00 


2.51E-07 


2.85E-01 


1.17E-02 


3.94E-01 


4.12E-01 


4.62E-03 


30 


3.04E-01 


5.83E-07 


2.65E-01 


1.77E-07 


3.70E-01 


3.88E-01 


2.88E-07 


35 


6.22E-01 


2.08E-07 


2.68E-01 


1.19E-07 


3.47E-01 


3.72E-01 


2.14E-07 


40 


5.38E-01 


2.01E-07 


6.30E-02 


1.27E-07 


3.31E-01 


3.56E-01 


2.14E-07 


45 


3.29E-01 


2.11E-07 


1.05E-01 


1.47E-07 


3.16E-01 


3.41E-01 


2.14E-07 


50 


2.06E-01 


1.31E-07 


5.66E-02 


1.46E-07 


2.87E-01 


3.11E-01 


2.14E-07 


55 


3.61E-02 


1.20E-07 


8.40E-02 


1.31E-07 


2.80E-01 


2.89E-01 


2.14E-07 


60 


1.18E-01 


2.46E-07 


5.70E-02 


1.09E-07 


2.66E-01 


2.82E-01 


2.14E-07 


65 


1.18E-01 


7.86E-08 


2.87E-02 


9.47E-08 


2.59E-01 


2.75E-01 


2.14E-07 


70 


8.92E-02 


1.17E-07 


2.23E-02 


8.15E-08 


2.52E-01 


2.69E-01 


2.14E-07 


75 


1.03E-01 


8.54E-08 


3.93E-02 


7.94E-08 


2.45E-01 


2.69E-01 


2.14E-07 



tion methods that are performed using the GLMnet package. The last column contains the 
result of the Logit-OMP algorithm. 

The results for DEXTER data set show that GraSP variants without debiasing and the 
convex methods achieve comparable loss values in most cases, whereas the convex methods 
show significantly better performance on the ARCENE data set. Nevertheless, except for a 
few instances where Logit-OMP has the best performance, the smallest loss values in both 
data sets are attained by GraSP methods with debiasing step. 

6. Discussion and Conclusion 

In many applications understanding high dimensional data or systems that involve these 
types of data can be reduced to identification of a sparse parameter. For example, in gene 
selection problems researchers are interested in locating a few genes among thousands of 
genes that cause or contribute to a particular disease. These problems can usually be 
cast as sparsity constrained optimizations. In this paper we introduce a greedy algorithm 
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called the Gradient Support Pursuit (GraSP) as an approximate solver for a wide range of 
sparsity-constrained optimization problems. 

We provide theoretical convergence guarantees based on the notions of a Stable Re- 
stricted Hessian (SRH) for smooth cost functions and a Stable Restricted Linearization 
(SRL) for non-smooth cost functions, both of which are introduced in this paper. Our 
algorithm generalizes the well-established sparse recovery algorithm CoSaMP that merely 
applies in linear models with squared error loss. The SRH and SRL also generalize the 
well-known Restricted Isometry Property for sparse recovery to the case of cost functions 
other than the squared error. To provide a concrete example we studied the requirements 
of GraSP for ^2-i'egularized logistic loss. Using a similar approach one can verify SRH con- 
dition for loss functions that have Lipschitz-continuous gradient that incorporates a broad 
family of loss functions. 

At medium- and large-scale problems computational cost of the GraSP algorithm is 
mostly affected by the inner convex optimization step whose complexity is polynomial in s. 
On the other hand, for very large-scale problems, especially with respect to the dimension 
of the input, p, the running time of the GraSP algorithm will be dominated by evaluation 
of the function and its gradient, whose computational cost grows with p. This problem is 
common in algorithms that only have deterministic steps; even ordinary coordinate-descent 
methods have this limitation (Nesterov, 2012). Similar to improvements gained by using 
randomization in coordinate-descent methods (Nesterov, 2012), introducing randomization 
in the GraSP algorithm could reduce its computational complexity at large-scale problems. 
This extension, however, is beyond the scope of this paper and we leave it for future work. 
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Appendix A. Iteration Analysis For Smooth Cost Functions 

To analyze our algorithm we first establish a series of results on how the algorithm operates 
on its current estimate, leading to an iteration invariant property on the estimation error. 
Propositions 1 and 2 are used to prove Lemmas 1 and 2. These Lemmas then are used to 
prove Lemma 3 that provides an iteration invariant which in turn yields the main result. 

Proposition 1. Let M (t) be a matrix-valued function such that for all t G [0,1], M (t) 
is symmetric and its eigenvalues lie in interval [B (t) ,A (t)] with B (t) > 0. Then for any 
vector V we have 

I B{t)dt\ ||v||2 < I IM{t)dt\ V < I I A{t)dt\ ||v||2 . 

Proof Let Amin (•) and Amax (•) denote the smallest and largest eigenvalue functions defined 
over the set of symmetric positive-definite matrices, respectively. These functions are in 
order concave and convex. Therefore, Jensen's inequality yields 



and 



/"M(t)dt| > /"Amin(M(t))di> f B{t)dt 



M{t)dt J < / A„,ax (M(i)) dt< J A{t)dt, 



which imply the desired result. 



25 



Proposition 2. Let M (t) be a matrix-valued function such that for all t G [0, 1] M (t) is 
symmetric and its eigenvalues lie in interval [B (t) , A (t)] with B (t) > 0. If T is a subset 
of row/column indices o/ M (•) then for any vector v we have 



P/-M(t)Pr<=dt V 



vO 



, A(t)-B(t) , „ „ 

< / -^ —dt vL. 

2 " "2 

2 



Proof Since M (t) is symmetric, it is also diagonalizable. Thus, for any vector v we may 
write 

|2 ^ ,,TT\/r i'4.\ ,, ^ A i'4.\ ll-,l|2 



and thereby 



B (t) ||v||2 < v^M (i) V < A (t) ||v||2 



A it) -Bit) ^ v^(M(t)-^i(%gMl)v ^ ^(,^_^(,) 



Since M (i) ^-^^-2 — ^I is also diagonalizable, it follows from the above inequality that 



M(t)-^W±^I 



< Mt)-B{t) J „^ Tv> ^^^ _ -dTt 



M(t) 



2 . Let M it) = PJ?M (t) Pre Since M (t) is a submatrix of 
^ 2 — '^ we should have 



M(t) 



< 



M(t)-;^«±^i 



< 



A(t)-i?(t) 



(12) 



Finally, it follows from the convexity of the operator norm, Jensen's inequality, and (12) 
that 



M it) dt 



< 



M(i) 



, Ait) -Bit) ^ 
dt< / ^ ' ^ — ^di. 



To simplify notation we introduce functions 

1 



"fc (p, q) = Mfc (iq + (1 - i) p) dt 



1 

/3fc(p,q) = f Bkitci+il-t)p)dt 



Ik (p, q) = ttfc (p, q) - /3k (p, q) , 
where A^ (•) and Bk (•) are defined by (5) and (6), respectively. 
Lemma 1. Let TZ denote the set supp (x — x*). The current estimate x then satisfies 

ui- *M II ^ 74s(x,x*)+72s(£,x-^) 11^ *|| , Pf{^* )\n\z\\^ + pf{^)W\\2 

IKX-X j |^c||2 < — T^-— ||X-X ||2H 



2/32. (x,x*) 



/32s (x,X* 
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Proof Since Z = supp (z2s) and \TZ\ < 2s we have ||z|7^||2 < Hzl^Hg and thereby 

||z|7^\^||2 < ||z|2\7^||2• (13) 

Furthermore, because z = V/ (x) we can write 

lhl7^\^||2>||V/(X)|7^\2-V/(x^)|7^\^||2-||V/(x*)|7^\2||2 



V/(x'^)k\2||2 

||V/(x*)k\z||2 



I Pl\zUj (tX + (1 - t) X*) dt (X - X' 

/ 

I I Pl\zUf (tX + (1 - t) X*) P^\2dt J (x - X^) \n\z 

f Pl\zilf (tX + (1 - t) X*) F^nTedt (X - x*) Un7^ 

where we spht the active coordinates (i.e., 7^) into the sets TZ\Z and Z CiTZ to apply the 
triangle inequality and obtain the last expression. Applying Propositions 1 and 2 yields 

Ihkv^ll^ >P2s (X,x'^) ||(X-x*) l^^^ll^-I?!^ ll(X-x^) Un7^|l2-||V/ (x'^) \n\z\\^ 

>P2s (X,x*) ||(X-x*) l^^^ll^-I?!^ p_x*||2-||V/ (x^) k^^ll^ . (14) 

Similarly, we have 

h\z\n\\^ < ||V/ (X) \z\n - V/ (x*) U^^jj^ + ||V/ (x*) \z\n\\^ 



I Pl\T^ilf (tX + (1 - t) X*) PTedt j (X - X* 

< ^'^^^'"^^ l(x-X-)k||2 + ||V/(x-)U\^||^ 



\Tl 



+ ||V/(x*)b\^||2 



74s (X, X*) 



|X-X*||2 + ||V/(X'^)U\7^||2. 



(15) 



Combining (13), (14), and (15) we obtain 

^^^|^||X-x*||2+||V/(x'^)|^^||^>||zb\^||^ 

>h\n\z\\2 

>/32.(x,x^)||(x-x^)k\^|l2-^:?i^ 

-\\Vfi^'')\n\z\\,- 
Since 7^ = supp(x— x*), we have ||(X— x*) |7^\2||2 — ll(x— x*) |^c||2- Hence, 



X — X 



(X-x*) 



2=ll2 



< 



74.;(X,X'') + 72s(X,X*) 

2/32S (X,x*) 



X-X1L + 



|V/(x1k\^||2 + ||V/(x-)|^7^| 
/32. (X,x-) 
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Lemma 2. The vector b given by 

b=argmin/(x) s.t. x|7-c = (16) 

satisfies 

II . I _,|| ^ ||V/(x-)|r|l2 , 74.(b,x-) 

11^ 1^ ^Il2< ;34.(b,x-) +2/34.(b,x*) II'' 1^^112- 

Proof We have 

1 
V/(x'^)-V/(b)=yH/(tx^ + (l-t)b)dt(x'^-b). 



Furthermore, since b is the solution to (16) we must have V/ (b) \j- = 0. Therefore, 
V/ (x^) Ir = I y" PrH/ (tx^ + (1 - 1) b) dt j (x^ - b) 

/'p^Hj(tx*+(l-i)b)Prdi J (x*-b)|r 

+ 1 /"pfH/(tx*+(l-t)b)Prcdt I (x*-b)|rc. 

Since / has /i4s-SRH and \TU supp (tx* + (1 — t) b)| < 4s for all t G [0, 1], functions A^g (•) 
and B^s {•), defined using (2) and (3), exist such that we have 

S4, (tx'^ + (1 - t) b) < A^in(Pf H/ (tx*+(i-t) b) Pr) 

and 

^4. (tx'^ + (1 - t)b) > Amax(PrH/ (tx* + (l-t)b)Pr) . 
Thus, from Proposition 1 we obtain 

/34. (b, x^) < A^in j f P^Uf (tx* + {l-t) b) Prdt J 



(17) 



and 



a4s (b, x^) > Ainax / PrH/ (tx^ + (1 - t) b) Prdi . 
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This result implies that the matrix L P!j-Hj (tx* + (1 — t) b) P7"dt, henceforth denoted by 
W, is invertible and 



^ < A^in (W-1) < A^ax (W-1) < ^ 



a4s(b,x*) 



/34s (b,x*)' 



(18) 



where we used the fact that Amax (M) Amin (M ^) = 1 for any positive-definite matrix M, 
particularly for W and W~^. Therefore, by multiplying both sides of (17) by W~^ obtain 



W-iv/(x*)|r = (x*-b)|r + W-i| /'pfH/(tx*+(l-t)b)Pr=dt I x* 



iT-c 



where we also used the fact that (x* — b) I7-C = x^lfc With S* = supp (x*), using triangle 
inequality, (18), and Proposition 2 then we obtain 



|x*lr-b||2 = ll(x*-b)|^||2 



< 



W 4 y"pTH^(tx*+(l-t) b) Pr^ns*dt j x*|rcn5^ 



+ ||W-iV/(xO|r|| 



||V/(x-)|r||2 , 74s (b,x-) 

- /34.(b,X^) ^2/34,(b,X-)ll I'^^ll2: 



as desired. 



Lemma 3 (Iteration Invariant). The estimation error in the current iteration, ||x — x*||2, 
and that in the next iteration, ||bs — x*||2, are related by the inequality: 



|, _ ^11 ^ 74s(x,X*)+72s(x,X*) / 74s (b,x^ 

'^ ''"^- 2/32s(x,X*) V /34s(b,X* 



X — X 



+ 



74s(b,x-^) \ ||V/(x-^)k\^||2 + ||V/(x-)U\7e| 

+ /34.(b,x*)y' /32s (X,x*) 



2^2||V/(x*)|r||2 



/34s (b,x* 



Proof Because ^ C 7" we must have 7"*^ C ^'^. Therefore, we can write ||x*|7-c||2 

II (x — X*) |t=||2 — II (^ ~ ^*) U'=ll2- Then using Lemma 1 we obtain 



^ 74s(x,x-)+72s(£,x-) ||V/(x'^)|^\2||^ + ||V/(x*)|^7e||2 

X 'J'c Q \ j;;^-^:; j^ j-^; 1 1 X — X 1 1 ~r " 



2/32s(x,X^) 



/32s (X,X*) 



(19) 



Furthermore, 



Ih — -x-*ll < llh — -Si-*l II -I-IIy*I II 
I s -'^ II2 — II s -^ I'7~ll2 ^^11 I'7"^ll2 



< X* 



\T 



b|L + ||bs — b|L + ||x*|t-c|L< 2 ||x*|-7- — b|L + ||x 



r':|l2- 



\T 



\T''\\2 



, (20) 



where the last inequality holds because || x*|^|L < s and b^ is the best s-term approximation 
of b. Therefore, using Lemma 2, 



|bs-x'^||2 < 



/34s (b,x^) 



|V/(x*)|r||2+ 1 + 



74s (b, X* 
/34s (b,x* 



|x IrHb' 



(21) 
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Combining (19) and (21) we obtain 

.74s (x,X*) +72s(x,X*) 



b. 



-Xll2<- 



2/32s(x,X* 



l + H^.llx-x- 
Pas (b, X* 



^ 74.(b,x-) \ ||V/(x-)k\^||, + ||V/(x-)U\^||^ ^ 2 II V/ (x*) |r||2 



/34.(b,x* 



/32s (X,X^ 



/34.(b,x* 



Using the results above, we can now prove Theorem 1. 

Proof of Theorem 1. Using definition 1 it is easy to verify that for k < k' and any 
vector u we have A^ (u) < A/^/ (u) and B^ (u) > B/./ (u). Consequently, for k < k' and any 
pair of vectors p and q we have a^ (p, q) < CKfc' (p, q), /3fc (p, q) > /3fc/ (p, q), and ^Ufc < /ifc/. 
Furthermore, for any function that satisfies ;Ufc— SRH we can write 

Qfc (P, q) ^ Jo Ak (tq + (1 - t) p) dt ^ /oVfc^fc(tq+(l-t)p)dt ^ 
/3fc(p,q) /jBfc(tq+(l-t)p)dt " Jo^ 5fc (tq + (1 - t) p) dt ^'^ 

and thereby Z'' ( ) < Hk — ^- Therefore, applying Lemma 3 to the estimate in the i-th 
iterate of the algorithm shows that 



X»-x^ 



< {^J'is - i)At4, 



X 



(i-1) 



X 



+ 



2||V/(x* 



lrll2 



+ Ai4s 



/34,(b,X* 

V/(x^)k\2||2 + ||V/(x*)U\^||2 



< (^L - ^4s) 



/32s(X(^-l),X*) 

2 



X 



(i-i) 



^ + ^||V/(x*)|x||2 + ^||V/(x*)|x||2 



Applying the assumption /i4s < ^-'^ then yields 



X«-x^ 



1 
< - 

2 - 2 



X(-i)-x^ 



3 + V3 



2 e 

The theorem follows using this inequality recursively. 



I|V/ (x* 



|X|l2 



Appendix B. Iteration Analysis For Non-Smooth Cost Functions 

In this part we provide analysis of GraSP for non-smooth functions. Definition 3 basically 
states that for any fc-sparse vector x G M", a^ (x) and /S^ (x) are in order the smallest and 
largest values for which 



/3fc (x) II A||^ < B/ (x + A II x) < ak (x) || A||^ 



(22) 



holds for all vectors A G M" that satisfy |supp (x) U supp (A)| < k. By interchanging x 
and X + A in (22) and using the fact that 

Bj (x + A II x)+Bj (x II X + A) = (V/ (x + A) - V/ (x) , A) 
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one can easily deduce 



[/3fc(x+A)+/3fe(x)]||A||^<(V/(x+A)-V/(x),A)<[afc(x+A) + afc(x)]||A||2. 



(23) 



Propositions 3, 4, and 5 establish some basic inequalities regarding the restricted Breg- 
man divergence under SRL assumption. Using these inequalities we prove Lemmas 4 and 5. 
These two Lemmas are then used to prove an iteration invariant result in Lemma 6 which 
in turn is used to prove Theorem 2. 

Note In Propositions 3, 4, and 5 we assume xi and X2 are two vectors in M" such that 
|supp (xi) U supp (x2)| < r. Furthermore, we use the shorthand A = xi — X2 and 
denote supp (A) by TZ. We also denote Vj (xi) — Vj (X2) by A'. To simplify the 
notation further the shorthands ai, /?/, and 7/ are used for ai (xi,X2) := a; (xi) + 
a«(x2), A(xi,X2) := /3z(xi) + /3i(x2), and7/(xi,X2) := a; (xi,X2) - A (xi,X2), 
respectively. 



Proposition 3. Let TV he a subset ofTZ. Then the following inequalities hold. 



I7^'ll2 



(A',A|^,; 



< 1r 



A|^,||2||A||2 



I7^'ll2 
Proof Using (22) we can write 

|2 ,2 



'A', A 



7^' 



<7r||A|^,||2||A||2 



2 ,2 



/3r (xi) II A|-;^, II2 t < Bf (xi — t A\^, II xi) < ar (xi) || Al^^, ||2i 



2 ,2 



I3r (X2) II A|-;^,||2t < Bf (X2 - t A\^, II X2) < ar (X2) II A|7^,||2i 



and 



/3r.(xi) ||A-t A|^,||2 < B/(x2 + t AIt^, II xi) < ar(xi) ||A-t Al^^, 

/3r (X2) 



||A-t A 

A - t A|^,||2 < Bf (xi - t A\j^, II X2) < ar {^2) ||A - t A\j^,\\^ , 



(24) 
(25) 

(26) 
(27) 

(28) 
(29) 



where t is an arbitrary real number. Using the definition of the Bregman divergence we can 
add (26) and (27) to obtain 



^r II A|7^,||^ t^ < / (xi - t A|^,) - / (Xi) + / (X2 + t A|^,) - / (X2) + (A', A|^,> t 

< ar W A\^,\\lt'^ . (30) 

Similarly, (28) and (29) yield 

^,||A-tA|7^,||2</(xi-tA|^,)-/(xi) + /(x2+tA|^,)-/(x2) + (A',A-tA|^,> 



< ar IIA - t A 



7^'ll2 ■ 



Expanding the quadratic bounds of (31) and using (30) then we obtain 



< 7r II A 
< 7r II A 



|2 ,2 



r II "I7^'II2* + 2 ( A 



r II ^l7^'ll2 



7^'ll2 



t" 



2 a 



\'R.'\\2 



{A,A\^,))t-Pr\\A\\l + {A',A) 



(A', A). 



(A, A\^,)] t + ar- II A| 



(31) 



(32) 
(33) 
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It follows from (23), (32), and (33) that 



< 7r II A|^, II2 1^ + 2 (pr II A|^, II2 - (A, A|^,) ) t + jr II AII2 



2^2 



< 7r II AL,|L t 



2 a 



\'Jl'\\2 



(A, AL,) t + 7,||A||2. 



(34) 
(35) 



These two quadratic inequalities hold for any t G M thus their discriminants are not positive, 
i.e., 

2 



/5r||AL, 



7e'ii2 



ar||A|7^,||2 



:a', al, 



11' 



(A', AU, 



7,2||A|7,,||2||A||2<0 



7,2||A|^,||^||A||2<0 



which yield the desired result. 



Proposition 4. The following inequalities hold for TV C TZ. 



|A'|^,||^-^.(A',A|^,) 



< 7r. II AL, 



7^'ll2 11AII2 



< 7r II AL, 



n'\\2 11^112 



Proof From (22) we have 

Pr (xi) II A'|^,||'t2 < B^ (xi - t A'l^, II xi) < ar (xi) II A'l^^ll' t^ 

Pr (X2) II A'|^,||2 t' < Bf (X2 + t A'l^, II X2) < ar {^2) \\ ^'\ti,\\1 *' 

and 

(3r (xi) II A - t A'|^,||2 < Bf (x2 + t A'l^, II xi) < ar (xi) II A - t A'|^,||2 

Pr (X2) II A - t A'|^,||2 < B/ (xi - t A'l^, II X2) < ar (X2) II A - t A'|^,||2 , 

for any t € M. By subtracting the sum of (40) and (41) from that of (38) and 
obtain 

^ll^'k'll2*'-«^ll^-*^'k'll2^2(A',A'|^,>t-(A',A) 

<a.||A'|^,||^t2-^.||A-tA'|^,||^. 
Expanding the bounds of (42) then yields 

< 7r II A'l^, g t" + 2 ((A', A'|^,> - ar (A, A'|^,» t + a, || A||2 - (A', A> 
< 7r II A'|^,||^ f - 2 ((A', A'|^,> - /3, (A, A'|^,» t - /3, || A||2 + (A', A> . 

Note that (A', A'l^^,) = || A'|^,||2 and (A, A'|^,) = (Al^^, , A'). Therefore, using 
obtain 



2.2 



0<7r A'L l^f^ + 2 A'L, 



\n'\\2 



0<7r||A'L,|l t 



2.2 



n'\\2 



n'\\2 

2 



a., 



(A', A|^,>)t + 7,||A 



A'|^,||^-^,(A',A|^,))t + 7,||A||2. 



(36) 

(37) 

(38) 
(39) 

(40) 

(41) 

(39) we 

(42) 

(43) 
(44) 

(23) we 

(45) 
(46) 
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Since the right-hand sides of (45) and (46) are quadratics in t and always non-negative for 
ah values of t G M, their discriminants cannot be positive. Thus we have 



A' 



A' 



Tl'\\2 
l|2 



a,(A',A|^,>) -7,2||A'|^,||^||Af <0 



which yield the desired result. 



^^ II A'l II IIAII^ < 

ir M-* 77'9 Nil — "' 



I7^'ll2 



Corollary 2. The inequality 

||A'|^,||2 >^r||A|^,||2-7r 

holds for 7^' C 71. 

Proof It follows from (36) and (24) that 



l7^\7^' 



A/| 11-^1— 2iiAi ii2 IIa/I 11-^1 — /a'ai \i— — iiAi ii2 /a'ai 

, l-TJ/llo + ttr l|A|7e,||2 =-||^ l77'll2 + "^ (A, A|7^,)+a^ ar ||A|7^,||2-(A, AIt^, 



7^'ll2 11^112 ■ 



< -fr II A'l^,!!^ IIAII2 + ar-fr || A|^, 

Therefore, after straightforward calculations we get 

II A'|^,||2 > - (-7r IIAII2 + \2ar II A|^,||2 - 7r IIAII2I) 

> ar II A|^, II2 — 7t- IIAII2 



> /5r||AL, 



n'\\2 



It 



\n\n' 



Proposition 5. Suppose that IC is a subset ofTZ'^ with at most k elements. Then we have 

llA'l^ll^ < 7fc+r||A||2. 

Proof Using (22) for any t E M we can write 

/3fc+,.(xi)||A\||2t'<B/(xi + tA'|^ II xi) <ak+r{^i)\\A'\Jlt^ (47) 

I3k+r (X2) II A\||2 t' < Bf (X2 - t A'\^ II X2) < Ofc+r (X2) || A^H^ t' (48) 

and similarly 



(3k+r (xi) II A+t A'l^ll^ < Bf (x2-t A'l^ II xi) < ak+r (xi) ||A+t A'l^l 
Pk+r (x2) llA+iA'l^jP < Bf (xi+t A'l^ II X2) < ak+r (x2) II A+t A' 



\K\\2 ■ 



(49) 
(50) 
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By subtracting the sum of (49) and (50) from that of (47) and (48) we obtain 



|2.2 



Pk+r ^'\A\oi - ^k+r A+t A'L ^ < -2t (A', A' 



1C\\2 



(A', A) 



< ak+r II A'l^gt' - h+r ||A+i A'l^ll^ . (51) 



Note that (A', A\) = || A\||2 and (A, A\) = 0. Therefore, (23) and (51) imply 
< %+r II A'Lf t^ ^ 2 II A'L|P t + 7fc+, ||A||2 



(52) 



hold for all t & M.. Hence, as quadratic functions of t, the right-hand side of (52) cannot 
have a positive discriminant. Thus we must have 



A 'I l|4 -2 II A l|2 II A'l l|2 ^ n 
|,cll2-^fc+- 11^112 11^ Iycll2^0 



which yields the desired result. 



Lemma 4. Let IZ denote supp (x — x*). Then we have 



(X-x* 



l2<=ll2 



^ 72^(x,X*)+74^(x,X*) „^ ^ 



V/(x* 



l7^\^ 



+ 



V/(x*)|^7, 



/32.(S,X*) " "" ' /32s (X,x*) 

Proof Given that ^ = supp(z2s) and \R,\ < 2s we have ||z|^||2 < ||z|2||2. Hence 

< 



iTVyZ 



Furthermore, using Corollary 2 we can write 



izyiz 



(53) 



Inxz 



> 



i^fi^)-^fi^n)\n\z - ^f(^^)\n\z 



72s (x,x )|Kx-x jl^n^l 



>/32s(x,X*) (X-X^)|^\2 

>^2s(X,X*) (X-X'')|^^2 ^-72s(X,X*)P-X^||2- 

Similarly, using Proposition 5 we have 



V/(x* 



i7e\2 



V/(x'^)l7e\2 



(54) 



l2\7^ 



V/(^)U\7e 



< 



(V/(X)-V^(x*))|^^^ 



+ 



VWx* 



< 74s (X,X*) ||X-X*||2 + 

Combining (53), (54), and (55) then yields 



/ v^ Jiz\n 
V/(x*)l2\7e 



(55) 



74s(x,X*) ||x-X*||2 + V/ (x 



l^\7^ 



> -72s (X,X*) II (X-X*)|^n2ll2 

+/32S (X,x'^) II (X-x*)|^^|[-|| V/(x'^)|^^ 
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Note that (x — x*)\fi\z = (^ — x*)!^^ Therefore, we have 

72s(x,x-) + 74.(x,x-) VKx*)|^\^ 

II l,x X J i^c II 2 — S ^^^ T^ l|x X II 9 -r 



+ 



V/(x^)U7e 



/32s (X,X*) 



/32s (X,X*) 



Lemma 5. The vector b given by 

b = argniin / (x) s.t. xJt-c = 

X 

Sn/jqfipS IItT*! - hll < Il^/(''"*)lrll2 I /l I 74«(x*,b) \ 11 ^1 II 

•' II I / 112— /34s{x*,b) \ /34s{x*,b)y II 1/ 112 



(56) 



Proof Since b satisfies (56) we must have V/ (b)|^ = 0. Then it follows from Corollary 
2 that 

l|x*lr-b||2 = ||(x''-b)|^||2 



< 



v/(x^)i^||2 , 74.(x^b) 



+ 



/34,(x*,b) /34s(x^b) 



I-7-CII2 • 



Lemma 6. The estimation error of the current iterate {i.e., ||x — x*||2) and that of the 
next iterate {i.e., ||bs — x^Hg) are related by the inequality: 



|b,-x*|L< 14 



274^ (x*, b) \ 72<i (x, x'')+745 (x, X* 



+ 1 + 



/34s(x^b) 

274s (x*,b) 



/32s (X%x-) 



|X-X*||2 + - 



2 Vf(x 



/l^ ilrib 



/34s(x*,b) 



V/(x*)k\^ 



+ 



V/(x^ 



\z\n 



^4s(x^b) 

Proof Since T"^ C Z"^ we have ||x*|^c||2 
applying Lemma 4 yields 



/32s (X,X*) 

(x — x*)|^c||2 < II (x — x*)|2c||2- Therefore, 



, *l II ^ 72s(x,x )+744x,x ) „^ ^^^„ ^ 

I X I '7~c 1 1 2 — 7=; /'^ . \ 11-^ -^ 1 1 9 ~r 



V/(x*)|^\2 



+ 



V/(x^ 



l^-R. 



/32s(X,X*) "^ /32s(X,X'^) 

Furthermore, as showed by (20) during the proof of Lemma 3, we again have 

IIV> — Y*ll ^9IIy*I — Vill _i_IIy*I II 

Hence, it follows from Lemma 5 that 



|bs-x*||2 < 



2||V/(x^)lrll2^A^274^1xVb)^^^^^„ 



/34s(x*,b) 



/34s(x*,b) 



\T''\\2 



(57) 



(58) 
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Combining (57) and (58) yields 

274s (x*, b) \ 72s (x, X'')+74s (x, X* 



|bs-X*||2< ( 1 + 



^4s(x*,b) 



/32s (X,X*) 



x-x* L + - 



2||V/(x 



irll2 



/34s(x^b) 



+ 1 + 



274s(x^b) 
A.(x^b) 



V/(x^ 



i7e\2 



+ 



V/(x* 



\z\n 



/32s (x,x*) 



Proof of Theorem 2. Let the vectors involved in the j'-th iteration of the algorithm 

4 



be denoted by superscript (j). Given that /X4s < '^"V ^ we have 



74s (X(^) , x^) V3 - 1 , ^ , 274, (x^ bO)) 1 + V3 

-= — 7^z7—^ r < : and 1 + -= — -, ttt^ < , 

/34s(xO),x'^) - 4 /34s(x^bO))- 2 



that yield, 



^ 274^^xVb)\ 72s(£^^'\x*)+74s(£(^^x*) ^ 1 + V3 ^ 274s (x^^'^ , x^) 
/S4s(x^b)y' ;S2s xO),x-) - 2 "" ^4s(x0-),x*) 



^ 1 + V3 V3-1 

< X 

-2 2 

_ 1 
~ 2' 



Therefore, it follows from Lemma 6 that 

1 



-0+1) ^ 

X -X 



< 

2 - 2 



X(J') - x^ 



+ ^ V/ x' 

2 e II J V 



IXII2 



Applying this inequality recursively for j = 0, 1, • • • , i — 1 then yields 

6 + 2^/3 



|X-X''||2 < 2-*||x*||2 + 



|V/(x^ 



1X112 



which is the the desired result. 
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